Herath S, Harandi M, Porikli F. Going deeper into action recognition: A survey[J]. Image and Vision Computing, 2017, 60: 4-21.

1. Overview

论文介绍了Action Recognition方面的四种深度学习结构

Spatiotemporal Network
Multiple Steam Network
Deep Generative Network
Temporal Coherency Network
以及Action Recognition数据集（KTH、Weizmann、Hollywood2、HMDB-51、UCF-101、Sports 1-M）。

2. Spatiotemporal Network

2.1. 3D CNN

在实际中，加入一些补充信息（optical flow）训练网络能得到更好的性能。将时间信息输入网络的过程称为fusion，有以下3种机制

slow fusion. 同时输入视频中的几个片段。在foveated结构中，输入context stream的同时，还输入了fovea stream（图像中心；假设在拍摄时，会将重要的内容移动到视野中心）

early fusion. 输入相邻帧集合
late fusion. 逐帧输入进行处理，最后将所有帧的特征融合
在3D CNN结构中，使用更长的帧能够提高效果；将3D卷积核分解为2D卷积核和1D卷积核能够减少参数。

2.2. RNN

先用3D CNN提取特征，再输入到LSTM中

LRCN（Long-term Recurrent Convolutional Network）

3. Multiple Stream Network

两个并行输入

基准帧
连续光流
特点
使用ImageNet预训练权重
光流 early fusion
multi-task训练（由于数据集较小，因此使用多数据集进行训练，每个数据集对应一个分类层）
在中间层进行fusion，既能提高效果，也能减少参数。

4. Deep Generative Models

时间序列维度的生成预测是一个无监督问题。

4.1. Dynencoder

分为三层

输入帧x_t得到h_t
根据h_t预测h_{t+1}
根据h_{t+1}生成帧x_{t+1}
先分别预训练每层，再end-to-end fine tuning.

4.2. LSTM Autoencoder

分为两层

encoder LSTM
decoder LSTM （合成和预测）

4.3. Adversarial方法训练

5. Temporal Coherency Network

一种弱监督形式。模型判断一段视频帧是否时序正确。

5.1. Siamese Network

判断给定序列是否Coherency。相比ImageNet预训练权重，give more attention to human poses，并且能够提高准确度。

缺点. 视频段之间可能会出现场景变化（如Sports 1M数据集中的广告插播）。

## 5.2. 基于Siamese网络的并行结构
将视频帧分为两个集合
- prediction set X_p
- effect set X_e
将X_p特征进行变换，与X_e特征进行对比，从而识别行为。

6. 数据数据

6.1. dataset

controlled condition (limited camera motion, almost zero background clutter)
limited to basic action (walking, running and jumping)

6.2. HMDB-51&UCF-101

非专业拍摄的Youtube视频(contain camera motion (and shakes), view-point variations and resolution inconsistencies)
Actions are well cropped in the temporal domain, not well-suited for measuring the performance of action localization
包含subtle classes (chewing and talking or playing violin and playing cello)，要求网络深层次理解时空线索

6.3. Hollywood2&Sports-1M

视角变换（view-point/editing complexities）
行为只发生在视频中某个很小的clips
Sports-1M中还包含观众和广告条

7. 未来发展

knowledge transfer & domain adaptation
算法混合（3D CNN、temporal pooling、 optical flow frames、LSTM）
提升性能（data augmentation、foveated architecture、distinct frame sampling strategis）

8. 实际应用

通常会涉及到joint dection（人体关键点检测）
fine-grained行为识别，而非识别所有类别的行为